Functional Value Iteration for Decision-Theoretic Planning with General Utility Functions

نویسندگان

  • Yaxin Liu
  • Sven Koenig
چکیده

We study how to find plans that maximize the expected total utility for a given MDP, a planning objective that is important for decision making in high-stakes domains. The optimal actions can now depend on the total reward that has been accumulated so far in addition to the current state. We extend our previous work on functional value iteration from one-switch utility functions to all utility functions that can be approximated with piecewise linear utility functions (with and without exponential tails) by using functional value iteration to find a plan that maximizes the expected total utility for the approximate utility function. Functional value iteration does not maintain a value for every state but a value function that maps the total reward that has been accumulated so far into a value. We describe how functional value iteration represents these value functions in finite form, how it performs dynamic programming by manipulating these representations and what kinds of approximation guarantees it is able to make. We also apply it to a probabilistic blocksworld problem, a standard test domain for decision-theoretic planners. Introduction Decision-theoretic planning researchers believe that Markov decision process models (MDPs) provide a good foundation for decision-theoretic planning (Boutilier, Dean, & Hanks, 1999; Blythe, 1999). Typically, they find plans for MDPs that maximize the expected total reward. For this planning objective, the optimal actions depend on the current state only. However, decision-theoretic planning researchers also believe that it is sometimes important to maximize the expected utility of the total reward (= expected total utility) for a given monotonically nondecreasing utility function. For example, utility theory suggests that human decision makers maximize the expected total utility in single-instance highstake planning situations, where their utility functions characterize their attitude toward risk (von Neumann & Morgenstern, 1944; Pratt, 1964). Examples of such situations include environmental crisis situations (Cohen et al., 1989; Blythe, 1998), business decisions situations (Murthy et al., 1999; Goodwin, Akkiraju, & Wu, 2002), and planning situations in space (Pell et al., 1998; Zilberstein et al., 2002), all of which are currently solved without taking risk attitudes into consideration. The question then arises how to Copyright c © 2006, American Association for Artificial Intelligence (www.aaai.org). All rights reserved. find plans for MDPs that maximize the expected total utility. This is a challenge because the optimal actions can then depend on the total reward that has been accumulated so far (= the current wealth level) in addition to the current state (Liu & Koenig, 2005b). We showed in previous publications that, in principle, a version of value iteration can be used to find a plan that maximizes the expected total utility for an arbitrary utility function if it maps every pair of state and wealth level into a value (Liu & Koenig, 2005b). We then developed such a version of value iteration, functional value iteration, that maintains a value function for every state that maps wealth levels into values. Functional value iteration is practical only if there exist finite representations of these value functions. We applied functional value iteration to one-switch utility functions, for which the value functions are piecewise one-switch and thus can be represented in finite form (Liu & Koenig, 2005b). In this paper, we develop a more general approach that approximates a large class of utility functions with piecewise linear utility functions (with and without exponential tails). It then uses functional value iteration to find a plan that maximizes the expected total utility for the approximate utility function. We describe how functional value iteration can represent the value functions in finite form, how it performs dynamic programming by manipulating these representations and what kinds of approximation guarantees it is able to make. We then apply it to a probabilistic blocksworld problem to demonstrate how the plan that maximizes the expected total utility depends on the utility function. Decision-Theoretic Planning We perform decision-theoretic planning on MDPs with action costs and want to find a plan that maximizes the expected total utility until plan execution stops, which only happens when a goal state has been reached. We now define our MDPs and this planning objective more formally. Our MDPs consist of a finite nonempty set of states S, a finite non-empty set of goal states G ⊆ S, and a finite nonempty set of actions A for each nongoal state s ∈ S \ G. An agent is given a time horizon 1 ≤ T ≤ ∞. The initial time step is t = 0. Assume that the agent is in state st ∈ S at time step t. If t = T or st is a goal state, then the agent stops executing actions, which implies that it no longer receives rewards in the future. Otherwise, it executes

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Probabilistic Planning with Risk-Sensitive Criterion

Probabilistic planning models and, in particular, Markov Decision Processes (MDPs), Partially Observable Markov Decision Processes (POMDPs) and Decentralized Partially Observable Markov Decision Processes (Dec-POMDPs) have been extensively used by AI and Decision Theoretic communities for planning under uncertainty. Typically, the solvers for probabilistic planning models find policies that min...

متن کامل

Risk-Sensitive Planning with One-Switch Utility Functions: Value Iteration

Decision-theoretic planning with nonlinear utility functions is important since decision makers are often risk-sensitive in high-stake planning situations. One-switch utility functions are an important class of nonlinear utility functions that can model decision makers whose decisions change with their wealth level. We study how to maximize the expected utility of a Markov decision problem for ...

متن کامل

Risk-Sensitive Planning in Partially Observable Environments

Partially Observable Markov Decision Process (POMDP) is a popular framework for planning under uncertainty in partially observable domains. Yet, the POMDP model is riskneutral in that it assumes that the agent is maximizing the expected reward of its actions. In contrast, in domains like financial planning, it is often required that the agent decisions are risk-sensitive (maximize the utility o...

متن کامل

Risk-sensitive planning in partially observable environments

Partially Observable Markov Decision Process (POMDP) is a popular framework for planning under uncertainty in partially observable domains. Yet, the POMDP model is riskneutral in that it assumes that the agent is maximizing the expected reward of its actions. In contrast, in domains like financial planning, it is often required that the agent decisions are risk-sensitive (maximize the utility o...

متن کامل

An exact algorithm for solving MDPs under risk-sensitive planning objectives with one-switch utility functions

One-switch utility functions are an important class of nonlinear utility functions that can model human beings whose decisions change with their wealth level. We study how to maximize the expected utility for Markov decision problems with given one-switch utility functions. We first utilize the fact that one-switch utility functions are weighted sums of linear and exponential utility functions ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006